{
  "metadata" : {
    "id" : "c8749075-25ae-48aa-a102-495a26a92d54",
    "name" : "batch_weblogs",
    "user_save_timestamp" : "1970-01-01T01:00:00.000Z",
    "auto_save_timestamp" : "1970-01-01T01:00:00.000Z",
    "language_info" : {
      "name" : "scala",
      "file_extension" : "scala",
      "codemirror_mode" : "text/x-scala"
    },
    "trusted" : true,
    "sparkNotebook" : null,
    "customLocalRepo" : null,
    "customRepos" : null,
    "customDeps" : null,
    "customImports" : null,
    "customArgs" : null,
    "customSparkConf" : null,
    "customVars" : null
  },
  "cells" : [ {
    "metadata" : {
      "id" : "A7ECD7E2EBF047A283DBADDCA0BB55E1"
    },
    "cell_type" : "markdown",
    "source" : "# Simple Weblog Analytics - The Batch Way\nIn this notebook, we are going to quickly visit a batch process of a series of weblog files to obtain the top trending pages per day."
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : false,
      "id" : "4D776A6E557446488681874D1D710705"
    },
    "cell_type" : "code",
    "source" : [ "// This is the location of the unpackaged files. Update accordingly\n", "// You can unpack the provided dataset with:\n", "// tar xvf datasets/NASA-weblogs/nasa_dataset_july_1995.tgz -C /tmp/data/\n", "val logsDirectory = \"/tmp/data/nasa_dataset_july_1995\"\n", "val rawLogs = sparkSession.read.json(logsDirectory)" ],
    "outputs" : [ ]
  }, {
    "metadata" : {
      "id" : "A60148A17B904A72A421640BB8E87474"
    },
    "cell_type" : "markdown",
    "source" : "## We define a schema for the data in the logs\nFollowing the formal description of the dataset (at: [NASA-HTTP](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html) ), the log is structured as follows:\n\n>The logs are an ASCII file with one line per request, with the following columns:\n- host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.\n- timestamp in the format \"DAY MON DD HH:MM:SS YYYY\", where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.\n- request given in quotes.\n- HTTP reply code.\n- bytes in the reply.\n\nThe dataset provided for this exercise offers this data in JSON format"
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : false,
      "id" : "82F9E2879A1747EF8CC36B5E5D64E701"
    },
    "cell_type" : "code",
    "source" : [ "import java.sql.Timestamp\n", "case class WebLog(host:String, \n", "                  timestamp: Timestamp, \n", "                  request: String, \n", "                  http_reply:Int, \n", "                  bytes: Long\n", "                 )" ],
    "outputs" : [ ]
  }, {
    "metadata" : {
      "id" : "CB0651C3AB3747EF9CB38477A6E59201"
    },
    "cell_type" : "markdown",
    "source" : "## We convert the raw data to structured logs"
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : false,
      "id" : "7AF0F566B58C49F195A3D1E4990BA3E6"
    },
    "cell_type" : "code",
    "source" : [ "import org.apache.spark.sql.functions._\n", "import org.apache.spark.sql.types.IntegerType\n", "val preparedLogs = rawLogs.withColumn(\"http_reply\", $\"http_reply\".cast(IntegerType))" ],
    "outputs" : [ ]
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : false,
      "id" : "833438CD8CA1471181AC509553B749C2"
    },
    "cell_type" : "code",
    "source" : [ "val weblogs = preparedLogs.as[WebLog]" ],
    "outputs" : [ ]
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : true,
      "id" : "9313FDE7CE7548B180AE2A47CDBED56E"
    },
    "cell_type" : "markdown",
    "source" : "## Now, we have the data in a structured format and we can start asking the questions that interest us.\nWe have imported the data and transformed it using a known schema.  We can use this 'structured' data to create queries that provide insights in the behavior of the users. "
  }, {
    "metadata" : {
      "id" : "144FB0F8C3AE42328435A2A8F3C150AB"
    },
    "cell_type" : "markdown",
    "source" : "### As a first step, we would like to know how many records are contained in our dataset."
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : false,
      "id" : "FBE9E30417204C9D83E8F1FE65AB6471"
    },
    "cell_type" : "code",
    "source" : [ "val recordCount = weblogs.count" ],
    "outputs" : [ ]
  }, {
    "metadata" : {
      "id" : "BEA0D41E58324A479D2B91093864E6C7"
    },
    "cell_type" : "markdown",
    "source" : "### A common question would be, what was the most popular URL per day?\nWe first reduce the timestamp to the day of the year. We then group by this new 'day of year' column and the request url and we count over this aggregate. We finally order using descending order to get this top URLs first."
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : false,
      "id" : "B65C7756F4EA47388884AB0DC4619CE8"
    },
    "cell_type" : "code",
    "source" : [ "val topDailyURLs = weblogs.withColumn(\"dayOfMonth\", dayofmonth($\"timestamp\"))\n", "                          .select($\"request\", $\"dayOfMonth\")\n", "                          .groupBy($\"dayOfMonth\", $\"request\")\n", "                          .agg(count($\"request\").alias(\"count\"))\n", "                          .orderBy(desc(\"count\"))" ],
    "outputs" : [ ]
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : false,
      "id" : "03E60D724B504A918E10873AEF0F0CF1"
    },
    "cell_type" : "code",
    "source" : [ "topDailyURLs.show()" ],
    "outputs" : [ ]
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : false,
      "presentation" : {
        "tabs_state" : "{\n  \"tab_id\": \"#tab2096092733-0\"\n}",
        "pivot_chart_state" : "{\n  \"hiddenAttributes\": [],\n  \"menuLimit\": 200,\n  \"cols\": [],\n  \"rows\": [],\n  \"vals\": [],\n  \"exclusions\": {},\n  \"inclusions\": {},\n  \"unusedAttrsVertical\": 85,\n  \"autoSortUnusedAttrs\": false,\n  \"inclusionsInfo\": {},\n  \"aggregatorName\": \"Count\",\n  \"rendererName\": \"Table\"\n}"
      },
      "id" : "F1CB4530E62149F08D39256D53BA1C07"
    },
    "cell_type" : "code",
    "source" : [ "topDailyURLs.take(10)\n" ],
    "outputs" : [ ]
  }, {
    "metadata" : {
      "id" : "4F7AB40ADF1D4EFF9166CDB63A0DC77E"
    },
    "cell_type" : "markdown",
    "source" : "### Top hits are all images. What now?\nIt's not unusual to see that the top URLs are images commonly used across a site.\n\nOur true interest lies in the content pages generating most traffic. To find those, we first filter on `html` content \nand then proceed to apply the top aggregation we just learned.\n\nThe request field is a quoted sequence of `[HTTP_VERB] URL [HTTP_VERSION]`. We will extract the url and preserve only those ending in `.html`, `.htm` or no extension (directories). This is a simplification for the purpose of the exercise. "
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : false,
      "id" : "7191956637C34DF28D24D2947E6999A8"
    },
    "cell_type" : "code",
    "source" : [ "val urlExtractor = \"\"\"^GET (.+) HTTP/\\d.\\d\"\"\".r\n", "val allowedExtensions = Set(\".html\",\".htm\", \"\")\n", "val contentPageLogs = weblogs.filter {log => \n", "  log.request match {                                        \n", "    case urlExtractor(url) => \n", "      val ext = url.takeRight(5).dropWhile(c => c != '.')\n", "      allowedExtensions.contains(ext)\n", "    case _ => false \n", "  }\n", "}\n", "                                      " ],
    "outputs" : [ ]
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : false,
      "presentation" : {
        "tabs_state" : "{\n  \"tab_id\": \"#tab1029193174-0\"\n}",
        "pivot_chart_state" : "{\n  \"hiddenAttributes\": [],\n  \"menuLimit\": 200,\n  \"cols\": [],\n  \"rows\": [],\n  \"vals\": [],\n  \"exclusions\": {},\n  \"inclusions\": {},\n  \"unusedAttrsVertical\": 85,\n  \"autoSortUnusedAttrs\": false,\n  \"inclusionsInfo\": {},\n  \"aggregatorName\": \"Count\",\n  \"rendererName\": \"Table\"\n}"
      },
      "id" : "844604F254BF4450B2D60F09A766C444"
    },
    "cell_type" : "code",
    "source" : [ "val topContentPages = contentPageLogs.withColumn(\"dayOfMonth\", dayofmonth($\"timestamp\"))\n", "                          .select($\"request\", $\"dayOfMonth\")\n", "                          .groupBy($\"dayOfMonth\", $\"request\")\n", "                          .agg(count($\"request\").alias(\"count\"))\n", "                          .orderBy(desc(\"count\"))" ],
    "outputs" : [ ]
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : false,
      "id" : "392A9C52AEC34D9F8EF709F5B407F0E1"
    },
    "cell_type" : "code",
    "source" : [ "topContentPages" ],
    "outputs" : [ ]
  }, {
    "metadata" : {
      "id" : "995C97DC2E9F4DBA95A832588B147843"
    },
    "cell_type" : "markdown",
    "source" : "We can see that the most popular page that month was `liftoff.html ` corresponding to the coverage of the launch of the Discovery shuttle, as documented on the NASA archives: https://www.nasa.gov/mission_pages/shuttle/shuttlemissions/archives/sts-70.html.\n\nIt's closely followed by `countdown/` the days prior ot the launch."
  }, {
    "metadata" : {
      "trusted" : true,
      "input_collapsed" : false,
      "collapsed" : true,
      "id" : "8DBF5B380B654300B6064149A1B63ABC"
    },
    "cell_type" : "code",
    "source" : [ "" ],
    "outputs" : [ ]
  } ],
  "nbformat" : 4
}